Executive Summary

The purpose of this project was to fit a series of regression models to a dataset containing housing features and a corresponding sale price as the response variable. Three models were constructed using both R and JAGS. One of the JAGS models used 3 features to predict sale prices while the final iteration used 4 features. A number of evaluation metrics (including the deviation information criterion (DIC)) were generated to gauge the accuracy of each model. It was determined that incorporating the 4th feature reduced the DIC and, hence, was a better for the data.

Introduction

The Ames Housing dataset, which is available on Kaggle.com, was compiled by Dean De Cock for use in data science education. It’s an incredible dataset resource for data scientists and statisticians looking for a modernized and expanded version of the often-cited Boston Housing dataset.

The subject dataset contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, along with the sale price of each home. For the purposes of this project, we will construct two models that leverage a subset (specifically, 3-4) of the most informative explanatory variables understand their ability to accurately predict sale prices for homes given their features.

The features to be included are:

We will create an additional iteration of the model that incorporates the feature TotRmsAbvGrd, the total number of rooms above grade.

Load and Explore the Dataset

dat = read.csv(file="data_files/housing-price-data.csv", header=TRUE)

# Subset the raw data to features of interest and the response variable, SalePrice
dat1=dat[,c("LotArea","HouseStyle","OverallCond","GrLivArea","SalePrice")]

# Code the categorical variable `HouseStyle`.
dat1$HouseStyle_coded = factor(dat1$HouseStyle)
newdat1 = dat1[,c("LotArea","HouseStyle_coded","OverallCond","GrLivArea","SalePrice")]
head(newdat1)
##   LotArea HouseStyle_coded OverallCond GrLivArea SalePrice
## 1    8450           2Story           5      1710    208500
## 2    9600           1Story           8      1262    181500
## 3   11250           2Story           5      1786    223500
## 4    9550           2Story           5      1717    140000
## 5   14260           2Story           5      2198    250000
## 6   14115           1.5Fin           5      1362    143000

Now, let’s look at the relationships among the features and their distributions:

library(Hmisc)
pairs(newdat1)

It looks like most, if not all, of the numeric features have a somewhat linear relationship with SalePrice.

hist.data.frame(newdat1)

It also appears that all of our non-categorical variables (including the response variable, SalePrice) have a somewhat normal distribution.

Postulate a Model

Since each of the features appear to have linear relationship with SalePrice, let’s start off by postulating a heirarchical JAGS linear model of the descriptor features. We will model with non-informative prior distributions (i.e., large \(\sigma^2\)) for the model coefficients. The model will take the form of

\(y_i \sim N(\mu_i,\sigma^2),\space \space \space \mu_i = \beta_0+\beta x_{1i}+...+\beta_k x_{ki}, \space \space \space \beta_k \sim N(0, 1e6)\)

Where \(k\) is the number of descriptor variables in the data set and \(i\) is the number of observations.

Also, \(y_i\space| \space x_i,\beta,\sigma^2 \stackrel{ind}{\sim}N(\beta_0+\beta x_{1i}+...+\beta_k x_{ki},\sigma^2)\),

where the noninformative prior for \(\sigma^2\) is modeled using an \(InverseGamma(\alpha,\beta)\) distribution.

library("rjags")
newdat1 = na.omit(newdat1)

mod1_jags_string = " model {
    for (i in 1:n) {
        y[i] ~ dnorm(mu[i], prec)
        mu[i] = b0 + b[1]*LotArea[i] + b[2]*OverallCond[i]+ b[3]*HouseStyle_coded1.5Unf[i]
        +b[4]*HouseStyle_coded1Story[i]+b[5]*HouseStyle_coded2.5Fin[i]+ b[6]*HouseStyle_coded2.5Unf[i]
        + b[7]*HouseStyle_coded2Story[i]+ b[8]*HouseStyle_codedSFoyer[i]+ b[9]*HouseStyle_codedSLvl[i]
    }
    b0 ~ dnorm(0.0, 1.0/1.0e6)
    for (i in 1:9) {
        b[i] ~ dnorm(0.0, 1.0/1.0e6)
    }
    prec ~ dgamma(5/2.0, 5*10.0/2.0)
    sig2 = 1.0 / prec
    sig = sqrt(sig2)
} "
data_jags = list(y=newdat1$SalePrice,LotArea=newdat1$LotArea,OverallCond=newdat1$OverallCond,
                 HouseStyle_coded1.5Unf=as.numeric(newdat1$HouseStyle_coded=="1.5Unf"),HouseStyle_coded1Story=as.numeric(newdat1$HouseStyle_coded=="1Story"),
                 HouseStyle_coded2.5Fin=as.numeric(newdat1$HouseStyle_coded=="2.5Fin"),HouseStyle_coded2.5Unf=as.numeric(newdat1$HouseStyle_coded=="2.5Unf"),
                 HouseStyle_coded2Story=as.numeric(newdat1$HouseStyle_coded=="2Story"),HouseStyle_codedSFoyer=as.numeric(newdat1$HouseStyle_coded=="SFoyer"),
                 HouseStyle_codedSLvl=as.numeric(newdat1$HouseStyle_coded=="SLvl"),n=nrow(newdat1)) 
params1 = c("b0","b", "sig")
inits1 = function() {
    inits = list("b0"=rnorm(1,0.0,100.0), "b"=rnorm(9,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}
mod1_jags = jags.model(textConnection(mod1_jags_string), data=data_jags, inits=inits1, n.chains=3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1460
##    Unobserved stochastic nodes: 11
##    Total graph size: 17039
## 
## Initializing model

Fit the Model using the Monte Carlo-Markov Chain (MCMC) Sampler

update(mod1_jags, 1000) # burn-in

mod1_jags_sim = coda.samples(model=mod1_jags,
                        variable.names=params1,
                        n.iter=5000)

mod1_jags_csim = do.call(rbind, mod1_jags_sim) # combine multiple chains

Check the Model by Examining Convergence Diagnostics

# gelman.diag(mod1_jags_sim)
# autocorr.diag(mod1_jags_sim)
# autocorr.plot(mod1_jags_sim)
# effectiveSize(mod1_jags_sim)
plot(mod1_jags_sim)

We can observe that the above traces for the \(b[i]\)s converge and the densities are all normal. Since we have convergence, we can conclude that our assumed prior probability distributions for the coefficients were reasonable. We can get a posterior summary of the parameters in our model.

summary(mod1_jags_sim)
## 
## Iterations = 1001:6000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean       SD Naive SE Time-series SE
## b[1]     4.516    0.218  0.00178       0.003037
## b[2] 18723.683  543.422  4.43702       8.117981
## b[3]   -69.654 1001.964  8.18101       8.138457
## b[4]  1846.208  977.089  7.97790       8.153038
## b[5]    30.843  997.046  8.14085       8.084877
## b[6]   -16.013  999.438  8.16038       8.160779
## b[7]  3103.923  982.738  8.02402       8.470398
## b[8]   -59.084  996.006  8.13236       8.132493
## b[9]    57.976  992.552  8.10416       8.104302
## b0    4390.004  987.233  8.06072       9.364645
## sig  87117.009 1735.543 14.17065      17.676296
## 
## 2. Quantiles for each variable:
## 
##           2.5%      25%       50%       75%     97.5%
## b[1]     4.092     4.37     4.514     4.665     4.942
## b[2] 17642.480 18360.56 18726.433 19086.084 19777.445
## b[3] -1995.871  -748.86   -61.569   601.272  1917.174
## b[4]   -45.896  1192.36  1840.503  2502.362  3770.414
## b[5] -1925.142  -645.30    32.594   700.720  1985.163
## b[6] -1986.272  -699.86   -15.171   665.299  1926.755
## b[7]  1192.001  2433.99  3103.429  3767.202  5046.794
## b[8] -2021.505  -736.43   -61.910   613.709  1894.689
## b[9] -1895.159  -598.33    54.210   712.913  1978.816
## b0    2451.783  3732.17  4376.803  5052.212  6329.450
## sig  83794.508 85933.80 87096.221 88261.711 90668.208

We notice that the standard deviation is quite high for the coefficients associated with the coded categorical variable HouseStyle_coded. This makes sense since the \(x_i\) for this feature can only take on values of \(0\) or \(1\).

Residual checks

In a Bayesian model, we have distributions for residuals, but we’ll simplify and look only at the residuals evaluated at the posterior mean of the parameters.

X = cbind(rep(1.0, data_jags$n), data_jags$LotArea,data_jags$OverallCond,data_jags$HouseStyle_coded1.5Unf,
          data_jags$HouseStyle_coded1Story,data_jags$HouseStyle_coded2.5Fin,data_jags$HouseStyle_coded2.5Unf,
          data_jags$HouseStyle_coded2Story,data_jags$HouseStyle_codedSFoyer,data_jags$HouseStyle_codedSLvl)
head(X)
##      [,1]  [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1  8450    5    0    0    0    0    1    0     0
## [2,]    1  9600    8    0    1    0    0    0    0     0
## [3,]    1 11250    5    0    0    0    0    1    0     0
## [4,]    1  9550    5    0    0    0    0    1    0     0
## [5,]    1 14260    5    0    0    0    0    1    0     0
## [6,]    1 14115    5    0    0    0    0    0    0     0
(pm_params = colMeans(mod1_jags_csim)) # posterior mean
##         b[1]         b[2]         b[3]         b[4]         b[5]         b[6] 
##     4.515934 18723.683170   -69.654493  1846.207866    30.842949   -16.012544 
##         b[7]         b[8]         b[9]           b0          sig 
##  3103.923261   -59.084345    57.976368  4390.004102 87117.009040

Create an inference vector, \(\hat{y}\), generated by the model and examine the residuals.

yhat = drop(X %*% pm_params[1:10])
resid = data_jags$y - yhat
plot(resid) # residuals against data index

plot(yhat, resid) # residuals against predicted values

qqnorm(resid) # checking normality of residuals

We see that the residuals are mostly concentrated about zero, however, there are a few outliers. Having this information would have us conclude that there are other features that could be added to the model to reduce the error of its inference generation.

Iterate with another model

As mentioned previously, we will build another linear model that incorporates the numeric feature TotRmsAbvGrd, which is the total rooms above grade (does not include bathrooms).

#Adjust the dataset to include `TotRmsAbvGrd`.
dat2=dat[,c("LotArea","HouseStyle","OverallCond","GrLivArea","TotRmsAbvGrd","SalePrice")]
dat2$HouseStyle_coded = factor(dat2$HouseStyle)
newdat2 = dat2[,c("LotArea","HouseStyle_coded","OverallCond","GrLivArea","TotRmsAbvGrd","SalePrice")]

# head(newdat2)

The JAGS model becomes:

mod2_jags_string = " model {
    for (i in 1:n) {
        y[i] ~ dnorm(mu[i], prec)
        mu[i] = b0 + b[1]*LotArea[i] + b[2]*OverallCond[i]+ b[3]*HouseStyle_coded1.5Unf[i]
        + b[4]*HouseStyle_coded1Story[i]+ b[5]*HouseStyle_coded2.5Fin[i]+ b[6]*HouseStyle_coded2.5Unf[i]
        + b[7]*HouseStyle_coded2Story[i]+ b[8]*HouseStyle_codedSFoyer[i]+ b[9]*HouseStyle_codedSLvl[i]
        + b[10]*TotRmsAbvGrd[i]
    }
    b0 ~ dnorm(0.0, 1.0/1.0e6)
    for (i in 1:10) {
        b[i] ~ dnorm(0.0, 1.0/1.0e6)
    }
    prec ~ dgamma(5/2.0, 5*10.0/2.0)
    sig2 = 1.0 / prec
    sig = sqrt(sig2)
} "
data2_jags = list(y=newdat2$SalePrice, LotArea=newdat2$LotArea,OverallCond=newdat2$OverallCond,
                 HouseStyle_coded1.5Unf=as.numeric(newdat2$HouseStyle_coded=="1.5Unf"),HouseStyle_coded1Story=as.numeric(newdat2$HouseStyle_coded=="1Story"),
                 HouseStyle_coded2.5Fin=as.numeric(newdat2$HouseStyle_coded=="2.5Fin"),HouseStyle_coded2.5Unf=as.numeric(newdat2$HouseStyle_coded=="2.5Unf"),
                 HouseStyle_coded2Story=as.numeric(newdat2$HouseStyle_coded=="2Story"),HouseStyle_codedSFoyer=as.numeric(newdat2$HouseStyle_coded=="SFoyer"),
                 HouseStyle_codedSLvl=as.numeric(newdat2$HouseStyle_coded=="SLvl"),TotRmsAbvGrd=dat2$TotRmsAbvGrd,
                 n=nrow(newdat2)) 
params1 = c("b0","b", "sig")
inits1 = function() {
    inits = list("b0"=rnorm(1,0.0,100.0), "b"=rnorm(10,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}
mod2_jags = jags.model(textConnection(mod2_jags_string), data=data2_jags, inits=inits1, n.chains=3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1460
##    Unobserved stochastic nodes: 12
##    Total graph size: 18584
## 
## Initializing model
update(mod2_jags, 1000) # burn-in

mod2_jags_sim = coda.samples(model=mod2_jags,
                        variable.names=params1,
                        n.iter=5000)

mod2_jags_csim = do.call(rbind, mod2_jags_sim) # combine multiple chains
# gelman.diag(mod2_jags_sim)
# autocorr.diag(mod2_jags_sim)
# autocorr.plot(mod2_jags_sim)
# effectiveSize(mod2_jags_sim)
plot(mod2_jags_sim)

summary(mod2_jags_sim)
## 
## Iterations = 1001:6000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##            Mean        SD Naive SE Time-series SE
## b[1]      2.188    0.1776  0.00145       0.002544
## b[2]   6339.411  639.7799  5.22378      14.803242
## b[3]    -79.366  998.2203  8.15043       8.150603
## b[4]   1825.850  958.1720  7.82344       8.176910
## b[5]    -84.747 1001.2881  8.17548       8.050979
## b[6]   -121.467  997.8990  8.14781       8.253272
## b[7]   1653.969  964.3769  7.87410       8.029389
## b[8]    -43.810 1000.8695  8.17206       8.111534
## b[9]    -57.073  983.0139  8.02627       8.041809
## b[10] 17404.354  586.7732  4.79098      14.002838
## b0     1848.166  987.4110  8.06218       9.582944
## sig   68006.059 1310.8944 10.70341      13.345699
## 
## 2. Quantiles for each variable:
## 
##            2.5%       25%       50%       75%     97.5%
## b[1]      1.843     2.068     2.186     2.307     2.543
## b[2]   5095.853  5898.568  6329.896  6775.045  7595.199
## b[3]  -2022.736  -751.697   -89.833   600.097  1888.300
## b[4]    -13.195  1171.850  1832.692  2465.480  3720.188
## b[5]  -2027.863  -753.905   -84.041   590.717  1876.619
## b[6]  -2080.128  -790.109  -119.393   554.646  1816.294
## b[7]   -263.788  1005.980  1660.790  2301.394  3535.793
## b[8]  -1994.089  -720.581   -43.400   638.966  1884.774
## b[9]  -1985.461  -714.932   -61.421   615.326  1852.118
## b[10] 16262.382 17009.180 17405.388 17803.185 18540.086
## b0      -68.407  1184.027  1831.996  2513.628  3791.564
## sig   65463.421 67105.530 67979.246 68876.941 70658.796

The model coefficient for TotRmsAbvGrd, b[10], is quite large as compared with the other coefficients, indicating that this feature is a stronger driver of SalePrice. As with the previous model, the trace plots show convergence for all the coefficients and normal densities.

Let’s check the residuals for this model.

X = cbind(rep(1.0, data2_jags$n), data2_jags$LotArea,data2_jags$OverallCond,data2_jags$HouseStyle_coded1.5Unf,
          data2_jags$HouseStyle_coded1Story,data2_jags$HouseStyle_coded2.5Fin,data2_jags$HouseStyle_coded2.5Unf,
          data2_jags$HouseStyle_coded2Story,data2_jags$HouseStyle_codedSFoyer,data2_jags$HouseStyle_codedSLvl,data2_jags$TotRmsAbvGrd)
head(X)
##      [,1]  [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,]    1  8450    5    0    0    0    0    1    0     0     8
## [2,]    1  9600    8    0    1    0    0    0    0     0     6
## [3,]    1 11250    5    0    0    0    0    1    0     0     6
## [4,]    1  9550    5    0    0    0    0    1    0     0     7
## [5,]    1 14260    5    0    0    0    0    1    0     0     9
## [6,]    1 14115    5    0    0    0    0    0    0     0     5
(pm_params = colMeans(mod2_jags_csim)) # posterior mean
##         b[1]         b[2]         b[3]         b[4]         b[5]         b[6] 
##     2.187958  6339.411218   -79.365965  1825.849805   -84.746797  -121.467282 
##         b[7]         b[8]         b[9]        b[10]           b0          sig 
##  1653.969007   -43.809989   -57.072815 17404.354319  1848.166374 68006.059051

Create an inference vector, \(\hat{y}\), generated by the model and examine the residuals.

yhat = drop(X %*% pm_params[1:11])
resid = data2_jags$y - yhat
plot(resid) # residuals against data index

plot(yhat, resid) # residuals against predicted values

qqnorm(resid) # checking normality of residuals

We see that these plots are very similar to those generated by the first model, however, the magnitude of the residual outliers for the second model is much less than those generated by the first model. We can conclude that adding the additional feature has indeed improved the model’s prediction capability.

Comparison of DIC metrics for both models.

Let’s compare the two models using the DIC metric:

dic.samples(mod1_jags, n.iter=1e3)
## Mean deviance:  37363 
## penalty 3.182 
## Penalized deviance: 37366
dic.samples(mod2_jags, n.iter=1e3)
## Mean deviance:  36640 
## penalty 3.599 
## Penalized deviance: 36643

Upon examination of the deviance information criterion (DIC), the penalized deviance of the second model (which incorporates TotRmsAbvGrd) is lower than the first model. We conclude that the second model that incorporates the additional feature is a better fit (and predictor) for the data. Further confirmation of this conclusion is found in the relatively larger magnitude in the b[10] coefficient.